"Influence sketching": Finding influential samples in large-scale regressions
نویسندگان
چکیده
There is an especially strong need in modern largescale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the “needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook’s distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call “influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence General Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware.
منابع مشابه
Large Scale Metagenomic Sequence Clustering via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clusters
Taxonomic clustering of species from millions of DNA fragments sequenced from their genomes is an important and frequently arising problem in metagenomics. High-throughput next generation sequencing is enabling the creation of large metagenomic samples, while at the same time making the clustering problem harder due to the short sequence length supported and sampling of hitherto unknown species...
متن کاملfinding influential individual in Social Network graphs using CSCS algorithm and shapley value in game theory
In recent years, the social networks analysis gains great deal of attention. Social networks have various applications in different areas namely predicting disease epidemic, search engines and viral advertisements. A key property of social networks is that interpersonal relationships can influence the decisions that they make. Finding the most influential nodes is important in social networks b...
متن کاملCentrality Measures, Upper Bound, and Influence Maximization in Large Scale Directed Social Networks
The paper addresses the problem of finding top k influential nodes in large scale directed social networks. We propose two new centrality measures, Diffusion Degree for independent cascade model of information diffusion and Maximum Influence Degree. Unlike other existing centrality measures, diffusion degree considers neighbors’ contributions in addition to the degree of a node. The measure als...
متن کاملLarge-eddy simulation of turbulent flow over an array of wall-mounted cubes submerged in an emulated atmospheric boundary-layer
Turbulent flow over an array of wall-mounted cubic obstacles has been numerically investigated using large-eddy simulation. The simulations have been performed using high-performance computations with local cluster systems. The array of cubes are fully submerged in a simulated deep rough-wall atmospheric boundary-layer with high turbulence intensity characteristics of environmental turbulent fl...
متن کاملCliques Role in Organizational Reputational Influence: A Social Network Analysis
Empirical support for the assumption that cliques are major determinants of reputational influence derives largely from the frequent finding that organizations which claimed that their cliques’ connections are influential had an increased likelihood of becoming influential themselves. It is suggested that the strong and consistent connection in cliques is at least partially responsible for the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016